1. Exploratory Data Analysis of The Vancouver Street Trees Dataset

This report was prepared by Sarah McDonald on December 12, 2021, as the final project for a Data Visualization class at the University of British Columbia using a subset of the Vancouver Street Trees Data [] provided.

../_images/tree-vancouver.jpg

Fig. 1.1 Street trees in Vancouver

# Import libraries needed for this analysis
import pandas as pd
import altair as alt
import json

pandas [] is used to handle data, altair [] is a package used for graphing, and json [] is used to create maps.

# Load in the data and view a subset
trees_url = 'https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv'
trees_df = pd.read_csv(trees_url, parse_dates=['date_planted'])
trees_df.head()
Unnamed: 0 std_street on_street species_name neighbourhood_name date_planted diameter street_side_name genus_name assigned ... plant_area curb tree_id common_name height_range_id on_street_block cultivar_name root_barrier latitude longitude
0 10747 W 20TH AV W 20TH AV PLATANOIDES Riley Park 2000-02-23 28.5 EVEN ACER N ... 15 Y 21421 NORWAY MAPLE 4 0 NaN N 49.252711 -123.106323
1 12573 W 18TH AV W 18TH AV CALLERYANA Arbutus-Ridge 1992-02-04 6.0 ODD PYRUS N ... 7 Y 129645 CHANTICLEER PEAR 2 2300 CHANTICLEER N 49.256350 -123.158709
2 29676 ROSS ST ROSS ST NIGRA Sunset NaT 12.0 ODD PINUS N ... 7 Y 154675 AUSTRIAN PINE 4 7800 NaN N 49.213486 -123.083254
3 8856 DOMAN ST DOMAN ST AMERICANA Killarney 1999-11-12 11.0 EVEN FRAXINUS N ... 7 Y 180803 AUTUMN APPLAUSE ASH 4 6900 AUTUMN APPLAUSE N 49.220839 -123.036721
4 21098 EAST BOULEVARD EAST BOULEVARD HIPPOCASTANUM Shaughnessy NaT 15.5 ODD AESCULUS Y ... N Y 74364 COMMON HORSECHESTNUT 4 5200 NaN N 49.238514 -123.154958

5 rows × 21 columns

# get more information about our datasset
trees_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Unnamed: 0          5000 non-null   int64         
 1   std_street          5000 non-null   object        
 2   on_street           5000 non-null   object        
 3   species_name        5000 non-null   object        
 4   neighbourhood_name  5000 non-null   object        
 5   date_planted        2363 non-null   datetime64[ns]
 6   diameter            5000 non-null   float64       
 7   street_side_name    5000 non-null   object        
 8   genus_name          5000 non-null   object        
 9   assigned            5000 non-null   object        
 10  civic_number        5000 non-null   int64         
 11  plant_area          4950 non-null   object        
 12  curb                5000 non-null   object        
 13  tree_id             5000 non-null   int64         
 14  common_name         5000 non-null   object        
 15  height_range_id     5000 non-null   int64         
 16  on_street_block     5000 non-null   int64         
 17  cultivar_name       2658 non-null   object        
 18  root_barrier        5000 non-null   object        
 19  latitude            5000 non-null   float64       
 20  longitude           5000 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(5), object(12)
memory usage: 820.4+ KB

2. Questions of Interest

For this analysis I am interested in how the number and type of trees planted has changed over time. From our initial look at the data, I can see that a lot of values are missing from the ‘date_planted’ column. This could be an error in data recording or it could be that we don’t have records of when older trees were planted. To visualize the gaps in our data, let’s first plot the dates we do have.

# rug plot to visualize date_planted column data
trees_date = alt.Chart(trees_df).mark_tick().encode(
             alt.X("date_planted:T", scale=alt.Scale())
             )

trees_date

It looks like we have continuous data from 1989-2019. If our theory is correct and data without values in the ‘date_planted’ column is from older trees, we could expect these trees to be larger than trees planted more recently. Let’s see if that holds true for our data.

To make the data easier to filter I will use the pandas [] package to add a new column to our data frame. A simple boolean will let us see if the date planted is availabe for that entry.

{
    trees_nan = trees_df.assign(date_record = trees_df.isna().loc[:,'date_planted'])
}
# add a boolean column to our datafrom for data_planted data available
trees_nan = trees_df.assign(date_record = trees_df.isna().loc[:, 'date_planted'])
trees_nan.head()
Unnamed: 0 std_street on_street species_name neighbourhood_name date_planted diameter street_side_name genus_name assigned ... curb tree_id common_name height_range_id on_street_block cultivar_name root_barrier latitude longitude date_record
0 10747 W 20TH AV W 20TH AV PLATANOIDES Riley Park 2000-02-23 28.5 EVEN ACER N ... Y 21421 NORWAY MAPLE 4 0 NaN N 49.252711 -123.106323 False
1 12573 W 18TH AV W 18TH AV CALLERYANA Arbutus-Ridge 1992-02-04 6.0 ODD PYRUS N ... Y 129645 CHANTICLEER PEAR 2 2300 CHANTICLEER N 49.256350 -123.158709 False
2 29676 ROSS ST ROSS ST NIGRA Sunset NaT 12.0 ODD PINUS N ... Y 154675 AUSTRIAN PINE 4 7800 NaN N 49.213486 -123.083254 True
3 8856 DOMAN ST DOMAN ST AMERICANA Killarney 1999-11-12 11.0 EVEN FRAXINUS N ... Y 180803 AUTUMN APPLAUSE ASH 4 6900 AUTUMN APPLAUSE N 49.220839 -123.036721 False
4 21098 EAST BOULEVARD EAST BOULEVARD HIPPOCASTANUM Shaughnessy NaT 15.5 ODD AESCULUS Y ... Y 74364 COMMON HORSECHESTNUT 4 5200 NaN N 49.238514 -123.154958 True

5 rows × 22 columns

To account for differences in species we want to break the records down by species. First let’s see how many species we are working with.

species = trees_nan.groupby("species_name")
species.describe()
Unnamed: 0 diameter ... latitude longitude
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
species_name
ABIES 3.0 11484.666667 7736.631718 4347.0 7374.00 10401.0 15053.5 19706.0 3.0 16.000000 ... 49.251689 49.265250 3.0 -123.139497 0.082919 -123.191800 -123.187300 -123.182800 -123.113346 -123.043891
ACERIFOLIA X 60.0 14736.833333 7736.247569 1152.0 8729.25 12926.0 21225.0 29978.0 60.0 22.355000 ... 49.263235 49.289708 60.0 -123.117517 0.047075 -123.198230 -123.150238 -123.122816 -123.078775 -123.030066
ACUTISSIMA 19.0 16161.631579 8395.660984 2483.0 11159.00 16611.0 23396.5 28798.0 19.0 11.355263 ... 49.263155 49.285991 19.0 -123.087162 0.038076 -123.166011 -123.113721 -123.089016 -123.058271 -123.028403
ALNIFOLIA 7.0 19888.285714 6129.725299 11189.0 15721.50 21692.0 24053.0 26788.0 7.0 7.642857 ... 49.271948 49.290517 7.0 -123.086372 0.052193 -123.157361 -123.132622 -123.055624 -123.044008 -123.038358
ALPINUM 1.0 7160.000000 NaN 7160.0 7160.00 7160.0 7160.0 7160.0 1.0 8.000000 ... 49.261980 49.261980 1.0 -123.176110 NaN -123.176110 -123.176110 -123.176110 -123.176110 -123.176110
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
WATERERI X 3.0 10674.000000 4363.166396 7523.0 8184.00 8845.0 12249.5 15654.0 3.0 18.833333 ... 49.247132 49.258560 3.0 -123.137358 0.067524 -123.209370 -123.168305 -123.127239 -123.101351 -123.075464
X YEDOENSIS 90.0 16544.900000 8492.142408 832.0 9711.75 17409.5 22845.5 29792.0 90.0 7.547222 ... 49.256834 49.289456 90.0 -123.117258 0.057906 -123.220360 -123.166355 -123.130054 -123.058314 -123.025868
XX 57.0 16790.192982 9060.538799 397.0 8913.00 18314.0 25920.0 29855.0 57.0 3.504386 ... 49.261244 49.289050 57.0 -123.097158 0.050363 -123.209720 -123.137614 -123.088452 -123.060023 -123.023650
YUNNANENSIS 1.0 5188.000000 NaN 5188.0 5188.00 5188.0 5188.0 5188.0 1.0 10.000000 ... 49.220989 49.220989 1.0 -123.100972 NaN -123.100972 -123.100972 -123.100972 -123.100972 -123.100972
ZUMI 65.0 13045.923077 8931.664130 33.0 5494.00 9988.0 21052.0 29456.0 65.0 5.203846 ... 49.264133 49.285638 65.0 -123.101302 0.058736 -123.214080 -123.157532 -123.082970 -123.055951 -123.026981

171 rows × 64 columns

3. Top 10

171 is a lot of species to visualize all at once. Let’s find our top 10 using the pandas package [] to group entries by their common name, count those entries, and sort from most to least common. Then we can filter so we see only the first 10 entries, the 10 most common trees planted!

{
    trees_common = (trees_nan.groupby("common_name").count().sort_values(by='tree_id', ascending=False
                    ).reset_index().loc[0:9])

}
#find the 10 most common trees in our dataset
trees_common = (trees_nan.groupby("common_name").count().sort_values(by='tree_id', ascending=False
                ).reset_index().loc[0:9])
trees_common = trees_common["common_name"].tolist()
trees_common
['KWANZAN FLOWERING CHERRY',
 'PISSARD PLUM',
 'NORWAY MAPLE',
 'CRIMEAN LINDEN',
 'PYRAMIDAL EUROPEAN HORNBEAM',
 'NIGHT PURPLE LEAF PLUM',
 'KOBUS MAGNOLIA',
 'AKEBONO FLOWERING CHERRY',
 'RED MAPLE',
 'KATSURA TREE']
# filter trees_nan to include only the most common trees
common_records = trees_nan.common_name.isin(trees_common)
trees_nan_small = trees_nan[common_records]
# chart average tree diameter per species (most common)
tree_diam = alt.Chart(trees_nan_small).mark_boxplot().encode(
            alt.X('diameter:Q'),
            alt.Y('common_name:N'),
            ).properties(width=300).facet('date_record')
tree_diam

As we can see from the chart above, trees without a date record do have a higher median diameter than trees with a date record. Trees increase in circumference as they age, a general formula for estimating the age of a trees is the diameter of the tree multiplied by a growth factor specific to the species.[]

\[ age \approx \frac{C}{\pi} \times G \]

Our theory that trees without date records are older seems be correct, we will exclude these values from future plots regarding date. To make analysis easier, I will add a column with just the year planted.

# remove entries with no date_planted
trees_small = trees_df.dropna(subset=['date_planted'])
# create a new column with just year 
trees_small = trees_small.assign(year_planted = trees_small['date_planted'].dt.year)
# number of trees planted over time
trees_time = alt.Chart(trees_small).mark_bar().encode(
             alt.X('year_planted:O'),
             alt.Y('count()'))
trees_time

4. Click to filter

Let’s make this chart clickable so we can filter our top 10 tree species by year.

click_year = alt.selection_multi(encodings=['x'], on='click')
click_trees_year = (trees_time.encode(
                   opacity=alt.condition(click_year, alt.value(1), alt.value(0.5)))
                  .properties(height=100, width=500)
                  .add_selection(click_year))
# select 10 most common trees based on year
species_select = (alt.Chart(trees_small).transform_filter(click_year).mark_bar().encode(
                    alt.Y('species_name:N', sort='x'),
                    alt.X('species_count:Q'),
                    ).transform_aggregate(
                    species_count="count()",
                    groupby=["species_name"]
                    ).transform_window(
                    rank='rank(species_count)',
                    sort=[alt.SortField("species_count", order="descending")]
                    ).transform_filter((alt.datum.rank <= 10)).add_selection(click_year))
species_select & click_trees_year

Interesting, there is less overlap in the top 10 species per year than I thought there would be. Now, I would like to look more at the size of trees. I wonder how the method of planting affects a tree’s size. To visualize I will use our top 10 data subset.

# Tree diameter vs height colored by species
tree_height = alt.Chart(trees_nan_small).mark_circle().encode(
              alt.X('diameter:Q'),
              alt.Y('height_range_id:Q'),
              color='species_name:N'
              )
tree_height
# facet our size chart by root barrier
tree_height.facet('root_barrier:N')
# facet tree size by side of street
tree_side = tree_height.properties(width=200).facet('street_side_name')
tree_side

5. Barriers to barriers

It looks like the side of the street trees are planted on makes no difference to size however, trees planted with a root barrier do seem to be smaller. Let’s see if the trees with root barriers are younger than those without using our full dataset.

../_images/root-barrier.jpg

Fig. 5.1 An example of a root barrier. These are used to prevent tree roots from damaging the sidewalk.

root_barrier = trees_time.encode(color="root_barrier:N")
root_barrier

It looks like most of the trees with root barriers were planted between 2004 and 2009. Let’s filter our data to include just those years and see if the pattern still holds.

tree_height_filter = alt.Chart(trees_small).transform_filter(
                   alt.FieldRangePredicate(field='year_planted', range=[2004, 2009])
                   ).mark_circle().encode(
                   alt.X('diameter:Q'),
                   alt.Y('height_range_id:Q')
                   ).properties(width=300).facet('root_barrier:N')
tree_height_filter

When we filter just for years that used root barriers the size difference is much less pronounced. Our initial observations about root barriers could have been because a smaller percentage of the data used root barriers.

Now, lets see how the trees are distributed over Vancouver. As part of this course code was provided to create a base map of Vancouver.

# load data to make a map of vancouver (code provided)
url_geojson = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'
data_geojson_remote = alt.Data(url=url_geojson, format=alt.DataFormat(property='features',type='json'))
data_geojson_remote
Data({
  format: DataFormat({
    property: 'features',
    type: 'json'
  }),
  url: 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'
})
# base map of Vancouver (code provided)
vancouver_map = alt.Chart(data_geojson_remote).mark_geoshape(
    color = 'white', opacity= 0.5, stroke='black').encode(
).project(type='identity', reflectY=True)

vancouver_map
#Map location of all trees in Vancouver
points = alt.Chart(trees_small).mark_circle(size=20).encode(
         longitude='longitude',
         latitude='latitude',
         ).project(type= 'identity', reflectY=True)

point_map = (vancouver_map + points)
point_map

To see how the distribution changes over time I am going to use the clickable year chart we made earlier.

point_map = point_map.encode(
                opacity=alt.condition(click_year, alt.value(1), alt.value(0.1)),
                color="species_name:N"
                ).add_selection(click_year)
point_map & click_trees_year

6. Conclusion

Interesting, over the years the distribution seems to be spread out evenly. I would have guessed that the street tree program would have started in a few neighbourhoods and branched out from there. There also doesn’t seem to be any clusters of particular species in neighbourhoods but it is hard to tell with so many species to consider. For the analysis report I think it will be interesting to explore the distribution of species planted over time and space using both time charts and a map. Linking our top 10 species per year chart will make the species distribution much easier to visualize. I am also very interested in our findings about the size of trees and root barriers so I will include those in our report as well.